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ABSTRACT 

Classical collaborative filtering, and content-based filtering 
methods try to learn a static recommendation model given 
training data. These approaches are far from ideal in highly 
dynamic recommendation domains such as news recommen¬ 
dation and computational advertisement, where the set of 
items and users is very fluid. In this work, we investigate 
an adaptive clustering technique for content recommenda¬ 
tion based on exploration-exploitation strategies in contex¬ 
tual multi-armed bandit settings. Our algorithm takes into 
account the collaborative effects that arise due to the inter¬ 
action of the users with the items, by dynamically grouping 
users based on the items under consideration and, at the 
same time, grouping items based on the similarity of the 
clusterings induced over the users. The resulting algorithm 
thus takes advantage of preference patterns in the data in 
a way akin to collaborative filtering methods. We provide 
an empirical analysis on medium-size real-world datasets, 
showing scalability and increased prediction performance (as 
measured by click-through rate) over state-of-the-art meth¬ 
ods for clustering bandits. We also provide a regret analysis 
within a standard linear stochastic noise setting. 
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1. INTRODUCTION 

Recommender Systems are an essential part of many suc¬ 
cessful on-line businesses, from e-commerce to on-line stream¬ 
ing, and beyond. Moreover, Computational Advertising can 
be seen as a recommendation problem where the user pref¬ 
erences highly depend on the current context. In fact, many 
recommendation domains such as Youtube video recommen¬ 
dation or news recommendation do not fit the classical de¬ 
scription of a recommendation scenario, whereby a set of 
users with essentially fixed preferences interact with a fixed 
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set of items. In this classical setting, the well-known cold- 
start problem, namely, the lack of accumulated interactions 
by users on items, needs to be addressed, for instance, by 
turning to hybrid recommendation methods (e.g., [^). In 
practice, many relevant recommendation domains are dy¬ 
namic, in the sense that user preferences and the set of 
active users change with time. Recommendation domains 
can be distinguished by how much and how often user pref¬ 
erences and content universe change (e.g., [^). In highly 
dynamic recommendation domains, such as news, ads and 
videos, active users and user preferences are fluid, hence 
classical collaborative filtering-type methods, such as Ma¬ 
trix or Tensor-Factorization break down. In these settings, 
it is essential for the recommendation method to adapt to 
the shifting preference patterns of the users. 

Exploration-exploitation methods, a.k.a. multi-armed ban¬ 
dits, have been shown to be an excellent solution for these 
dynamic domains (see, e.g., the news recommendation evi¬ 
dence in [^). While effective, standard contextual bandits 
do not take collaborative information into account, that is, 
users who have interacted with similar items in the past 
will not be deemed to have similar taste based on this fact 
alone, while items that have been chosen by the same group 
of users will also not be considered as similar. It is this sig¬ 
nificant limitation in the current bandit methodology that 
we try to address in this work. Past efforts on this problem 
were based on using online clustering-like algorithms on the 
graph or network structure of the data in conjunction with 
multi-armed bandit methods (see Section]^. 

Commercial large scale search engines and information 
retrieval systems are examples of highly dynamic environ¬ 
ments where users and items could be described in terms of 
their membership in some preference cluster. For instance, 
in a music recommendation scenario, we may have groups of 
listeners (the users) clustered around music genres, with the 
clustering changing across different genres. On the other 
hand, the individual songs (the items) could naturally be 
grouped by sub-genre or performer based on the fact that 
they tend to be preferred by the same group of users. Ev¬ 
idence has been collected which suggests that, at least in 
specific recommendation scenarios, like movie recommenda¬ 
tion, data are well modeled by clustering at both user and 
item sides (e.g., |M]). 

In this paper, we introduce a Collaborative Filtering based 
stochastic multi-armed bandit method that allows for a flex¬ 
ible and generic integration of information of users and items 
interaction data by alternatively clustering over both user 
and item sides. Specifically, we describe and analyze an 


adaptive and efficient clustering of bandit algorithm that 
can perform collaborative filtering, named COFIBA (pro¬ 
nounced as “coffee bar”). Importantly enough, the clustering 
performed by our algorithm relies on sparse graph represen¬ 
tations, avoiding expensive matrix factorization techniques. 
We adapt COFIBA to the standard setting of sequential 
content recommendation known as (contextual) multi-armed 
bandits (e.g., [^) for solving the canonical exploration vs. 
exploitation dilemma. 

Our algorithm works under the assumption that we have 
to serve content to users in such a way that each content 
item determines a clustering over users made up of rela¬ 
tively few groups (compared to the total number of users), 
within which users tend to react similarly when that item 
gets recommended. However, the clustering over users need 
not be the same across different items. Moreover, when the 
universe of items is large, we also assume that the items 
might be clustered as a function of the clustering they de¬ 
termine over users, in such a way that the number of distinct 
clusterings over users induced by the items is also relatively 
small compared to the total number of available items. 

Our method aims to exploit collaborative effects in a ban¬ 
dit setting in a way akin to the way co-clustering techniques 
are used in batch collaborative filtering. Bandit methods 
also represent one of the most promising approaches to the 
research community of recommender systems, for instance 
in tackling the cold-start problem (e.g., [25| ), whereby the 
lack of data on new users leads to suboptimal recommen¬ 
dations. An exploration approach in these cases seems very 
appropriate. 

We demonstrate the efficacy of our dynamic clustering al¬ 
gorithm on three benchmark and real-world datasets. Our 
algorithm is scalable and exhibits signihcant increased pre¬ 
diction performance over the state-of-the-art of clustering 
bandits. We also provide a regret analysis of the -^/T-style 
holding with high probability in a standard stochastically 
linear noise setting. 

2. LEARNING MODEL 

We assume that the user behavior similarity is encoded by 
a family of clusterings depending on the specific feature (or 
context, or item) vector x under consideration. Specifically, 
we let W = {1,... ,n} represent the set of n users. Then, 
given X £ set U can be partitioned into a small number 
m{x) of clusters Ui{x), U2{x), ..., Um{a:)ix), where m(x) is 
upper bounded by a constant m, independent of x, with 
m being much smaller than n. (The assumption m « n 
is not strictly required but it makes our algorithms more 
effective, and this is actually what we expect our datasets 
to comply with.) The clusters are such that users belong¬ 
ing to the same cluster Uj{x) tend to have similar behav¬ 
ior w.r.t. feature vector x (for instance, they both like or 
both dislike the item represented by x), while users lying in 
different clusters have significantly different behavior. The 
mapping x ->• {Ui{x),U2{x),... ,Um(:^){x)} specifying the 
actual partitioning of the set of users U into the clusters de¬ 
termined by X (including the number of clusters m{x) and 
its upper bound m), as well as the common user behavior 
within each cluster are unknown to the learning system, and 
have to be inferred based on user feedback. 

For the sake of simplicity, this paper takes the simple 
viewpoint that clustering over users is determined by lin¬ 
ear functions x —>■ ujx, each one parameterized by an un¬ 


known vector Ui € hosted at user i € 2A, in such a way 
that if users i and i' are in the same cluster w.r.t. x then 
uJX — uj,x, while if i and i' are in different clusters w.r.t. 
X then \ujX — uj,x\ > 7 , for some (unknown) gap param¬ 
eter 7 > 0, independent of a;H As in the standard linear 
bandit setting (e.g., [T^ [2T] [34| [^ [^, and 

references therein), the unknown vector Ui determines the 
(average) behavior of user i. More concretely, upon receiv¬ 
ing context vector x, user i “reacts” by delivering a payoff 
value 

ai{x) ^ ujx + ei{x) , 

where ei{x) is a conditionally zero-mean and bounded vari¬ 
ance noise term so that, conditioned on the past, the quan¬ 
tity uJ X is indeed the expected payoff observed at user i 
for context vector x. Notice that the unknown parameter 
vector Ui we associate with user i is supposed to be time 
invariant in this model0 

Since we are facing sequential decision settings where the 
learning system needs to continuously adapt to the newly 
received information provided by users, we assume that the 
learning process is broken up into a discrete sequence of 
rounds: In round t = 1,2,..., the learner receives a user 
index it £ U to serve content to, hence the user to serve 
may change at every round, though the same user can re¬ 
cur many times. We assume the sequence of users ii,i 2 ,... 
is determined by an exogenous process that places nonzero 
and independent probability to each user being the next one 
to serve. Together with it, the system receives in round t 
a set of feature vectors = {xt^i,xt^ 2 , ■ ■ ■ ,Xt^ct} ^ R'* 
encoding the content which is currently available for rec¬ 
ommendation to user it- The learner is compelled to pick 
some xt = Xt^kt £ Ci^ to recommend to it, and then ob¬ 
serves it’s feedback in the form of payoff at G R whose 
(conditional) expectation is uj^xt. The goal of the learn¬ 
ing system is to maximize its total payoff X]t=i ^ 

rounds. When the user feedback at our disposal is only the 
click/no-click behavior, the payoff at is naturally interpreted 

as a binary feedback, so that the quantity becomes a 

clickthrough rate (CTR), where at = 1 if the recommended 
item was clicked by user it, and at = 0, otherwise. CTR 
is the measure of performance adopted by our comparative 
experiments in Section 

From a theoretical standpoint (Section]^, we are instead 
interested in bounding the cumulative regret achieved by our 
algorithms. More precisely, let the regret rt of the learner at 
time t be the extent to which the average payoff of the best 
choice in hindsight at user it exceeds the average payoff of 
the algorithm’s choice, i.e., 

rt = ( max uj^xj—uj^xt . 


We are aimed at bounding with high probability the cumu¬ 
lative regret 1 ^^e probability being over the noise 

variables ei^{xt), and any other possible source of random¬ 
ness, including it - see Section 


^ As usual, this assumption may be relaxed by assuming 
the existence of two thresholds, one for the within-cluster 


distance of uJ x to ujix, the other for the between-cluster 
distance. 

^ It would in fact be possible to lift this whole machinery 
to time-drifting user pre ferences by combining with known 


techniques (e.g., 11 ^) 






The kind of regret bound we would like to contrast to is 
one where the latent clustering structure over U (w.r.t. the 
feature vectors x) is somehow known beforehand (see Sec¬ 
tion]^ for details). When the content universe is large but 
known a priori, as is frequent in many collaborative filter¬ 
ing applications, it is often desirable to also group the items 
into clusters based on similarity of user preferences, i.e., two 
items are similar if they are preferred by many of the same 
users. This notion of “two-sided” clustering is well known in 
the literature; when the clustering process is simultaneously 
grouping users based on similarity at the item side and items 
based on similarity at the user side, it goes under the name of 
“co-clustering” (see, e.g., [14||15| ). Here, we consider a com¬ 
putationally more affordable notion of collaborate filtering 
based on adaptive two-sided clustering. 

Unlike previous existing clustering techniques on ban¬ 
dits (e.g., [^), our clustering setting only applies to 

the case when the content universe is large but known a 
priori (yet, see the end of Section]^. Specifically, let the 
content universe be X = {xi,X 2 , ■ ■ ■ ,x^x\\, and P{xh) = 
{Ui{xh), U2{xh), ..., U,n(^a:^){xh)} be the partition into clus¬ 
ters over the set of users U induced by item x^- Then items 
Xfi, Xf^i £ T belong to the same cluster (over the set of items 
I) if and only if they induce the same partition of the users, 
i.e., if P{xh) = P{xy). We denote by g the number of dis¬ 
tinct partitions so induced over U by the items in X, and 
work under the assumption that g is unknown but signifi¬ 
cantly smaller than \X\. (Again, the assumption g << \X\ 
is not strictly needed, but it both makes our algorithms ef¬ 
fective and is expected to be satisfied in relevant practical 
scenarios.) 

Finally, in all of the above, an important special case is 
when the items to be recommended do not possess specific 
features (or do not possess features having significant pre¬ 
dictive power). In this case, it is common to resort to the 
more classical non-contextual stochastic multiarmed ban¬ 
dit setting (e.g., [^ |^), which is recovered from the con¬ 
textual framework by setting d — \X\, and assuming the 
content universe X is made up of the d-dimensional vectors 
eh,h = 1,... ,d, of the canonical basis of As a conse¬ 
quence, the expected payoff of user i on item h is simply the 
h-ih component of vector Ui, and two users i and i' belong 
to the same cluster w.r.t. to h if the /i-th component of ut 
equals the /i-th component of . Because the lack of useful 
annotation on data was an issue with all datasets at our dis¬ 
posal, it is this latter modeling assumption that motivates 
the algorithm we actually implemented for the experiments 
reported in Section]^ 

3. RELATED WORK 

Batch collaborative filtering neighborhood methods rely 
on finding similar groups of users and items to the target 
user-item pair, e.g., [^, and thus in effect rely on a dynamic 
form of grouping users and items. Collaborative Filtering- 
based methods have also been integrated with co-clustering 
techniques, whereby preferences in each co-cluster are mod¬ 
eled with simple statistics of the preference relations in the 
co-cluster, e.g., rating averages [18| . 

Beyond the general connection to co-clustering (e.g., [14| 


dit algorithms for trading off exploration and exploitation 
through dynamic clustering. We are not aware of any spe¬ 
cific piece of work that combines bandits with co-clustering 


15 ), our paper is related to the research on multi-armed ban¬ 


based on the scheme of collaborative filtering; the papers 
which are most closely related to ours are 
|22| [^ [^ . In , the authors work under the assumption 
that users are defined using a feature vector, and try to 
learn a low-rank hidden subspace assuming that variation 
across users is low-rank. The paper combines low-rank ma¬ 
trix recovery with high-dimensional Gaussian Process Ban¬ 
dits, but it gives rise to algorithms which do not seem prac¬ 
tical for sizeable problems. In 27 , the authors analyze a 
non-contextual stochastic bandit problem where model pa¬ 
rameters are assumed to be clustered in a few (unknown) 
types. Yet, the provided solutions are completely different 
from ours. The work combines (fc-means-like) online 
clustering with a contextual bandit setting, but clustering 
is only made at the user side. The pape r also relies on 
bandit clustering at the user side (as in [27[ |29| ), with an 
emphasis on diversifying recommendations to the same user 
over time. In the authors propose cascading bandits of 
user behavior to identify the k most attractive items, and 
formulate it as a stochastic combinatorial partial monitoring 
problem. Finally, the algorithms in can be seen as 

a special case of COFIBA when clustering is done only at 
the user side, under centralized or decentralized 

environments. 

Similar in spirit are also ^ |10[ : In [^ , the authors 

define a transfer learning problem within a stochastic multi¬ 
armed bandit setting, where a prior distribution is defined 
over the set of possible models over the tasks; in [^, the 
authors rely on clustering Markov Decision Processes based 
on their model parameter similarity. In [^, the authors 
discuss how to choose from n unknown distributions the k 
ones whose means are largest by a certain metric; in 
the authors study particle Thompson sampling with Rao- 
Blackwellization for online matrix factorization, exhibiting a 
regret bound in a very specific case of n x m rank-1 matrices. 
Yet, in none of above cases did the authors make a specific 
effort towards item-dependent clustering models applied to 
stochastic multi-armed bandits. 

Further work includes [25[ |32| . In [25| , an ensemble of 
contextual bandits is used to address the cold-start problem 
in recommender systems. A similar approach is used in 
to deal with cold-start in recommender systems but based 
on the probability matching paradigm in a parameter-free 
bandit strategy, which employs online bootstrap to derive 
the distribution of the estimated models. In contrast to 
our work, in neither [25] nor are collaborative effects 
explicitly taken into account. 


4. THE ALGORITHM 

COFIBA, relies on upper-confidence-based tradeoffs be¬ 
tween exploration and exploitation, combined with adaptive 
clustering procedures at both the user and the item sides. 
COFIBA stores in round t an estimate Wi^t of vector Ui 
associated with user i £ U. Vectors Wi^t are updated based 
on the payoff feedback, as in a standard linear least-squares 
approximation to the corresponding Ui. Every user i £ Li 
hosts such an algorithm which operates as a linear bandit al¬ 
gorithm (e.g., [12[ |3l[I]) on the available content Ci^. More 
specifically, Wi^t-i is determined by an inverse correlation 
matrix subject to rank-one adjustments, and a vector 

subject to additive updates. Matrices are initial¬ 
ized to the dx d identity matrix, and vectors bi^t are initial¬ 
ized to the d-dimensional zero vector. Matrix M~^_^ is also 





used to define an upper confidence bound CBi,t-i(a:) in the 
approximation of Wi^t-i to Ui along direction x. Based on 
the local information encoded in the weight vectors Wi^t-i 
and the confidence bounds the algorithm also 

maintains and updates a family of clusterings of the set of 
users U, and a single clustering over the set of items X. On 
both sides, such clusterings are represented through con¬ 
nected components of undirected graphs (this is in the same 
vein as in [^), where nodes are either users or items. A 
pseudocode description of our algorithm is contained in Fig¬ 
ures [Din and while Figure illustrates the algorithm’s 
behavior through a pictorial example. 

At time t, COFIBA receives the index it of the cur¬ 
rent user to serve, along with the available item vectors 
Xt^i, ■.. ,Xt^ctj a-nd must select one among them. In order 
to do so, the algorithm computes the Ct neighborhood sets 
Nk, one per item xt,k € based on the current aggrega¬ 
tion of users (clusters “at the user side”) w.r.t. item Xt^k- 
Set Nk should be regarded as the current approximation to 
the cluster (over the users) it belongs to when the cluster¬ 
ing criterion is defined by item xt,k- Each neighborhood set 
then defines a compound weight vector (through 

the aggregation of the corresponding matrices Mt^t-i and 
vectors (which, in turn, determines a compound con¬ 

fidence boundj CBjv^,t-i(a:t,fc). Vector WN,.,t-i and confi¬ 
dence bound CBNk,t-iixt,k) are combined through an upper- 
confidence exploration-exploitation scheme so as to commit 
to the specific item xt G Ci^ for user it- Then, the payoff 
at is received, and the algorithm uses xt to update Mt^^t-i 
to and to bt^^t- Notice that the update is only 

performed at user it, though this will affect the calculation 
of neighborhood sets and compound vectors for other users 
in later rounds. 

After receiving payoff at and computing Mt^^t and bt^^t, 
COFIBA updates the clusterings at the user side and the 
(unique) clustering at the item side. In round t, there are 
multiple graphs th® user side (hence 

many clusterings over U, indexed by h), and a single graph 
GI = {X, E( ) at the item side (hence a single clustering 
over X). Each clustering at the user side corresponds to a 
single cluster at the item side, so that we have gt clusters 
over items and gt clusterings over users - see 
Figure]^ for an example. On both user and item sides, up¬ 
dates take the form of edge deletions. Updates at the user 
side are only performed on the graph G^-j^ pointed to by 
the selected item xt = xt,kf Updates at the item side are 
only made if it is likely that the neighborhoods of user it has 
significantly changed when considered w.r.t. two previously 
deemed similar items. Specifically, if item Xh was directly 
connected to item Xt at the beginning of round t and, as a 
consequence of edge deletion at the user side, the set of users 
that are now likely to be close to it w.r.t. xt, is no longer 
the same as the set of users that are likely to be close to it 
w.r.t. Xt, then this is taken as a good indication that item 
Xh is not inducing the same partition over users as xt does, 
hence edge {xt,Xh) gets deleted. Notice that this need not 
imply that, as a result of this deletion, the two items are 
now belonging to different clusters over X, since these two 
items may still be indirectly connected. 

® The one given in Figure is the confidence bound we use 
in our experiments. In fact, the theoretical counterpart to 
CB is significantly more involved, same efforts can also be 
found in order to close the gap, e.g., in [^[^. 


Input: 

• Set of users W = {1,..., n}; 

• set of items I = {cci, ..., a5|i|} C R"*; 

• exploration parameter o > 0, and edge deletion param¬ 
eter 02 > 0. 


Init: 

• bi^Q = 0 S R”^ and Mt^o = h £ i = 1,... n; 

• User graph i = (W, XlYi), G^i is connected over IA\ 

• Number of user graphs g\ = 

• No. of liser clusters = 1; 

• Item clusters 7i,i = X, no. of item clusters gi = 1] 

• Item graph = (X, -Ef), G\ is connected over X. 

for t = 1, 2,..., T do 
Set 

Wi^t-l = M~Y_^bi^t-i, i = l,...,n; 

Receive it E U, and get items Ci^ = {®t,i, • • ■, ®t,ct} ^ X; 
For each k = 1,... ,ct, determine which cluster (within the 
current user clustering w.r.t. xt^k) user it belongs to, and 
denote this cluster by Nk‘, 

Compute, for = 1,..., ct, aggregate quantities 

=/-f (Mi,t_i -/), 

iSNk 

bNk,t-l = 

i£Nk 

■WNk.t-i = ^lNl,t-i^Nk,t-i ; 

/ _ -p \ 

kt= argmax i'w^^ ^_J^Xt^k+CBM,,,t-lixt,k)), 

where CBjVfc,t-i(a:) = a log(i + 1); 

Set for brevity xt = Xt kt'i 

Observe payoff at E M, and update weights Mi^t and bi t as 
follows: 

• Mi^^t = Mi^^t-i + , 

• Set Mi^t = Mi^t-l, bi^t = bi^t-i for all i ^ it , 

Determine ht E {1,..., pt} such that kt E 
Update user clusters at graph ) by per¬ 

forming the steps in Figure 
For all h^ht, set ^ = G^ 

Update item clusters at graph G^ = (X, ) by performing 

the steps in Figure]^, 
end for 

Figure 1: The COFIBA algorithm. 


Update user clusters at graph G^^ as follows: 

• Delete from all {it,j) such that 

t,ht 

- wJtXtl > CBij,t(St) -I- CBj,t(St) , 

where CBi^t(a:) = 02 \Jx^M~Yx log (4 + 1 ); 

• Let E^ - be the resulting set of edges, set 

t+l,ht 

G^ - = (U,E^ ^ ), and compute associated clus- 

t+i,ht ^ \ ^ 

. , 1 T as the con- 


ters U. . , . T ,Ury . , . u ,... ,U u 

l,t+l,ht^ 2,t+l,ht^ ’ '^t+l h. 

nected components of G^ - . 


Figure 2: User cluster update in the COFIBA 





Update item clusters at graph G( as follows: 

• For all i such that {xt,X£) G Ej. build neighborhood 

as: 

= [j ■■ j ¥^k, IwJ^^txe - wjtxel 

• Delete from all {xt,X£) such that ("^t) 7^ 

where is the neighborhood of 

node it w.r.t. graph - ; 

• Let E/+1 be the resulting set of edges, set 

= (X,compute associated item clus¬ 
ters l 2 ^t+i, • ■ • I igt+i,t+i through the connected 

components of 

• For each new item cluster created, allocate a new con¬ 
nected graph over users representing a single (degener¬ 
ate) cluster U. 

Figure 3: Item cluster update in the COFIBA 



User graphs Item graph User graphs Item graph 

(b) Round t (c) Round t +1 


It is worth stressing that a naive implementation of 
COFIBA would require memory allocation for maintain¬ 
ing |ir|-many n-node graphs, i.e., 0{n^ |ir|). Because this 
would be prohibitive even for moderately large sets of users, 
we make full usage of the approach of [^, where instead of 
starting off with complete graphs over users each time a new 
cluster over items is created, we randomly sparsify the com¬ 
plete graph by drawing an Erdos-Renyi initial graph, still 
retaining with high probability the underlying clusterings 
{Ui{xh),h = 1,..., \I\, over users. This 
works under the assumption that the latent clusters Ui(xh) 
are not too small - see the argument in , where it is shown 
that in practice the initial graphs can have Cl(n log n) edges 
instead of 0{'n?). Moreover, because we modify the item 
graph by edge deletions only, one can show that with high 
probability (under the modeling assumptions of Section]^ 
the number gt of clusters over items remains upper bounded 
by g throughout the run of COFIBA, so that the actual stor¬ 
age required by the algorithm is indeed 0{ng\ogn). This 
also brings a substantial saving in running time, since updat¬ 
ing connected components scales with the number of edges 
of the involved graphs. It is this graph sparsification tech¬ 
niques that we used and tested along the way in our exper¬ 
imentation parts. 

Finally, despite we have described in Section a setting 
where T and U are known a priori (the analysis in Section 
[^currently holds only in this scenario), nothing prevents in 
practice to adapt COFIBA to the case when new content 
or new users show up. This essentially amounts to adding 
new nodes to the graphs at either the item or the user side, 
by maintaining data-structures via dynamic memory allo¬ 
cation. In fact, this is precisely how we implemented our 
algorithm in the case of very big item or user sets (e.g., the 
Telefonica and the Avazu dataset in the next section). 

5. EXPERIMENTS 

We compared our algorithm to standard bandit base¬ 
lines on three real-world datasets: one canonical benchmark 
dataset on news recommendations, one advertising dataset 
from a living production system, and one publicly available 
advertising dataset. In all cases, no features on the items 
have been used. We closely followed the same experimen- 


Figure 4 : In this example, U = {1,...6} and 

T = {xi,...,xs} (the items are depicted here as 
1, 2 ,..., 8). (a) At the beginning we have gi = 1, with 
a single item cluster li i = X and, correspondingly, 
a single (degenerate) clustering over W, made up of 
the unique cluster U. (b) In round t we have the 
gt = S item clusters h^t = {*1,0:2}, h^t = 10:3,0:4,0:5}, 
l3,t = {®6, 0:7, 0:3}. Corresponding to each one of them 
are the three clusterings over U depicted on the left, 
so that m})i = 3 , mY^2 = 2 , and 7714)3 = 4 . In this 
example, it = 4 , and Xt = X5, hence ht = 2 , and we 
focus on graph GY,2t corresponding to user clustering 
{{1, 2 , 3 }, { 4 , 5, 6}}. Suppose in GY^2 the only neighbors 
of user 4 are 5 and 6. When updating such user clus¬ 
tering, the algorithm considers therein edges ( 4 , 5) 
and ( 4 , 6) to be candidates for elimination. Suppose 
edge ( 4 , 6) is eliminated, so that the new clustering 
over U induced by the updated graph GY+1^2 becomes 
{{1, 2 , 3 }, { 4 , 5}, {6}}. After user graph update, the al¬ 
gorithm considers the item graph update. Suppose 
*5 is only connected to *4 and *3 in Gi, and that 
*4 is not connected to *3, as depicted. Both edge 
(*5, *4) and edge (*5,0:3) are candidates for elimi¬ 
nation. The algorithm computes the neighborhood 
N of it — 4 : according to GY+i^2j compares it to 

the the neighborhoods N^t+iik), for £ = 3 , 4 . As¬ 
sume N ^ ^3^4+1 {it), because the two neighborhoods 
of user 4 are now different, the algorithm deletes 
edge (*5,*3) from the item graph, splitting the item 
cluster {*3, *4, *5} into the two clusters {*3} and 
{*4, *5}, hence allocating a new cluster at the item 
side corresponding to a new degenerate clustering 
{{1, 2 , 3 , 4 , 5, 6}} at the user side, (c) The resulting 
clusterings at the beginning of round t + 1. (In 
this picture it is assumed that edge (*5, *4) was not 
deleted from the item graph at time t.) 

tal setting as in previous work [12[ [I], thereby evaluating 
prediction performance by click-through rate. 

5.1 Datasets 

Yahoo!. The first dataset we use for the evaluation is the 


freely available benchmark dataset which was released in the 
“ICML 2012 Exploration & Exploitation Challenge’|^ The 
aim of the challenge was to build state-of-the-art news arti¬ 
cle recommendation algorithms on Yahoo! data, by building 
an algorithm that learns efficiently a policy to serve news 
articles on a web site. The dataset is made up of random 
traffic records of user visits on the “Today Module” of Ya¬ 
hoo!, implying that both the visitors and the recommended 
news article are selected randomly. The available options 
(the items) correspond to a set of news articles available for 
recommendation, one being displayed in a small box on the 
visited web page. The aim is to recommend an interesting 
article to the user, whose interest in a given piece of news 
is asserted by a click on it. The data has 30 million visits 
over a two-week time stretch. Out of the logged information 
contained in each record, we used the user ID in the form 
of a 136-dimensional boolean vector containing his/her fea¬ 
tures (index it), the set of relevant news articles that the 
system can recommend from (set Ct^); a randomly recom¬ 
mended article during the visit; a boolean value indicating 
whether the recommended article was clicked by the visit¬ 
ing user or not (payoff at). Because the displayed article 
is chosen uniformly at random from the candidate article 
pool, one can use an unbiased off-line evaluation method to 
compare bandit algorithms in a reliable way. We refer the 
reader to for a more detailed description of how this 
dataset was collected and extracted. We picked the larger 
of the two datasets considered in [^, resulting in n « 18A' 
users, and d = 323 distinct items. The number of records 
ended up being 2.8M, out of which we took the first 300A' 
for parameter tuning, and the rest for testing. 

Telefonica. This dataset was obtained from Telefon¬ 
ica S.A., which is the number one Spanish broadband and 
telecommunications provider, with business units in Europe 
and South America. This data contains clicks on ads dis¬ 
played to user on one of the websites that Telefonica oper¬ 
ates on. The data were collected from the back-end server 
logs, and consist of two files: the first file contains the ads 
interactions (each record containing an impression times¬ 
tamp, a user-ID, an action, the ad type, the order item ID, 
and the click timestamp); the second file contains the ads 
metadata as item-ID, type-ID, type, order-ID, creative type, 
mask, cost, creator-ID, transaction key, cap type. Overall, 
the number n of users was in the scale of millions, while 
the number d of items was approximately 300. The data 
contains 15M records, out of which we took the first 1, 5M 
for parameter tuning, and the rest for testing. Again, the 
only available payoffs are those associated with the items 
served by the system. Hence, in order to make the proce¬ 
dure be an effective estimator in a sequential decision process 
(e.g., |13[ |17[ [^), we simulated random choices by the 

system by generating the available item sets Ci^ as follows: 
At each round t, we stored the ad served to the current user 
it and the associated payoff value at {1 ^“clicked”, 0 =“not 
clicked”). Then we created Ct^ by including the served ad 
along with 9 extra items (hence Ct = 10 Vt) which were 
drawn uniformly at random In such a way that, for any item 
Bh G T, if Bh occurs in some set Ct ^, this item will be the 
one served by the system 1/10 of the times. The random 
selection was done independent of the available payoff val- 


^ https: / /explochallenge .inrla. fr/category/challenge 


ues at- All our experiments on this dataset were run on a 
machine with 64GB RAM and 32 Intel Xeon cores. 

Avazu. This dataset was prepared by Avazu Incj^which 
is a leading multinational corporation in the digital adver¬ 
tising business. The data was provided for the challenge 
to predict the click-through rate of impressions on mobile 
devices, i.e., whether a mobile ad will be clicked or not. 
The number of samples was around 40M, out of which we 
took the hrst 4M for parameter tuning, and the remaining 
for testing. Each line in the data file represents the event 
of an ad impression on the site or in a mobile application 
(app), along with additional context information. Again, 
payoff at is binary. The variables contained in the dataset 
for each sample are the following: ad-ID; timestamp (date 
and hour); click (boolean variable); device-ID; device IP; 
connection type; device type; ID of visited App/Website; 
category of visited App/Website; connection domain of vis¬ 
ited App/Website; banner position; anonymized categorical 
fields (Cl, C14-C21). We pre-processed the dataset as fol¬ 
lows: we cleaned up the data by filtering out the records 
having missing feature values, and removed outliers. We 
identified the user with device-ID, if it is not null. The 
number of users on this dataset is in the scale of millions. 
Similar to the Telefonica dataset, we generated recommen¬ 
dation lists of length ct = 20 for each distinct timestamp. 
We used the first 4M records for tuning parameters, and 
the remaining 36M for testing. All data were transferred to 
Amazon S3, and all jobs were run through the Amazon EC2 
Web Service. 

5.2 Algorithms 

We compared COFIBA to a number of state-of-the-art 
bandit algorithms: 

• LINUCB-ONE is a single instance of the ucbI 
algorithm, which is a very popular and established al¬ 
gorithm that has received a lot of attention in the re¬ 
search community over the past years; 

• DYNUCB is the dynamic UCB algorithm of This 
algorithm adopts a “A'-means’-like clustering tech¬ 
nique so as to dynamically re-assign the clusters on 
the fly based on the changing contexts and user pref¬ 
erences over time; 

• LINUCB-IND is a set of independent UCBI in¬ 
stances, one per user, which provides a fully personal¬ 
ized recommendation for each user; 

• CLUB is the state-of-the-art online clustering of 
bandits algorithm that dynamically cluster users based 
on the confidence ellipsoids of their models; 

• LINUCB-V 1^ is also a single instance of ucbI, but 
with a more sophisticated confidence bound; this algo¬ 
rithm turned out to be the winner of the “ICML 2012 
Challenge” where the Yahoo! dataset originates from. 

We tuned the optimal parameters in the training set with a 
standard grid search as indicated in [13[ [I], and used the test 
set to evaluate the predictive performance of the algorithms. 
Since the system’s recommendation need not coincide with 
the recommendation issued by the algorithms we tested, we 
only retained the records on which the two recommendations 
were indeed the same. Because records are discarded on 

® https://www.kaggle.eom/c/avazu-ctr-prediction 




Yahoo Dataset 



Figure 5 : Results on the Yahoo dataset. 

the fly, the actual number T of retained records (“Rounds” 
in the plots of the next subsection) changes slightly across 
algorithms; T was around IQK for the Yahoo! data, SbOR' 
for the Telefonica data, and QOOif for the Avazu data. All 
experimental results we report were averaged over 3 runs 
(but in fact the variance we observed across these runs was 
fairly small). 

5.3 Results 

Our results are summarized in Figures and [7| Fur¬ 
ther evidence is contained in Figure In Figures [5]^ we 
plotted click-through rate (“CTR”) vs. retained records so 
far (“Rounds”). All these experiments are aimed at testing 
the performance of the various bandit algorithms in terms of 
prediction performance, also in cold-start regimes (i.e., the 
first relatively small fraction of the time horizon in the x- 
axis). Our experimental setting is in line with previous ones 
(e.g., [12[ [T]) and, by the way the data have been prepared, 
gives rise to a reliable estimation of actual CTR behavior 
under the same experimental conditions as in [12[ [I]. Figure 
[^is aimed at supporting the theoretical model of section]^ 
by providing some evidence on the kind of clustering statis¬ 
tics produced by COFIBA at the end of its run. 

Whereas the three datasets we took into consideration are 
all generated by real online web applications, it is worth 
pointing out that these datasets are indeed different in the 
way customers consume the associated content. Generally 
speaking, the longer the lifecycle of one item the fewer the 
items, the higher the chance that users with similar prefer¬ 
ences will consume it, and hence the bigger the collaborative 
effects contained in the data. It is therefore reasonable to 
expect that our algorithm will be more effective in datasets 
where the collaborative effects are indeed strong. 

The users in the Yahoo! data (Figure]^, are likely to span 
a wide range of demographic characteristics; on top of this, 
this dataset is derived from the consumption of news that 
are often interesting for large portions of these users and, as 
such, do not create strong polarization into subcommunities. 
This implies that more often than not, there are quite a few 
specific hot news that all users might express interest in, and 
it is natural to expect that these pieces of news are intended 
to reach a wide audience of consumers. Given this state of 
affairs, it is not surprising that on the Yahoo! dataset both 
LINUCB-ONE and LINUCB-V (serving the same news to 
all users) are already performing quite well, thereby making 
the clustering-of-users effort somewhat less useful. This also 
explains the poor performance of LINUCB-IND, which is 
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Figure 6: Results on the Telefonica dataset. 
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Figure 7 : Results on the Avazu dataset. 


not performing any clustering at all. Yet, even in this non¬ 
trivial case, COFIBA can still achieve a significant increased 
prediction accuracy compared, e.g., to CLUB, thereby sug¬ 
gesting that simultaneous clustering at both the user and 
the item (the news) sides might be an even more effective 
strategy to earn clicks in news recommendation systems. 

Most of the users in the Telefonica data are from a diverse 
sample of people in Spain, and it is easy to imagine that this 
dataset spans a large number of communities across its pop¬ 
ulation. Thus we can assume that collaborative effects will 
be much more evident, and that COFIBA will be able to 
leverage these effects efficiently. In this dataset, CLUB per¬ 
forms well in general, while DYNUCB deteriorates in the 
initial stage and catches-up later on. COFIBA seems to sur¬ 
pass all other algorithms, especially in the cold-start regime, 
all other algorithms being in the same ballpark as CLUB. 
Finally, the Avazu data is furnished from its professional 
digital advertising solution platform, where the customers 
click the ad impressions via the iOS/Android mobile apps 
or through websites, serving either the publisher or the ad¬ 
vertiser which leads to a daily high volume internet traffic. 
In this dataset, neither LINUCB-ONE nor LINUCB-IND 
displayed a competitive cold-start performance. DYNUCB 
is underperforming throughout, while LINUCB-V demon¬ 
strates a relatively high CTR. CLUB is strong at the be¬ 
ginning, but then its CTR performance degrades. On the 
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Figure 8: A typical distribution of cluster sizes over 
users for the Yahoo dataset. Each bar plot corre¬ 
sponds to a cluster at the item side. We have 5 
plots since this is the number of clusters over the 
items that COFIBA ended up with after sweeping 
once over this dataset in the run at hand. Each 
bar represents the fraction of users contained in the 
corresponding cluster. For instance, the first cluster 
over the items generated 16 clusters over the users 
(bar plot on top), with relative sizes 31%, 15%, 12%, 
etc. The second cluster over the items generated 10 
clusters over the users (second bar plot from top) 
with relative sizes 61%, 12%, 9%, etc. The relative 
size of the 5 clusters over the items is as follows: 
83%, 10%, 4%, 2%, and 1%, so that the clustering 
pattern depicted in the top plot applies to 83% of 
the items, the second one to 10% of the items, and 
so on. 


other hand, COFIBA seems to work extremely well during 
the cold-start, and comparatively best in all later stages. 

In Figure]^ we give a typical distribution of cluster sizes 
produced by COFIBA after at the end of its run|^ The 
emerging pattern is always the same: we have few clusters 
over the items with very unbalanced sizes and, correspond¬ 
ing to each item cluster, we have few clusters over the users, 
again with very unbalanced sizes. This recurring pattern is 
in fact the motivation behind our theoretical assumptions 
(Section]^, and a property of data that the COFIBA algo¬ 
rithm can provably take advantage of (Section These bar 
plots, combined with the comparatively good performance 
of COFIBA, suggest that our datasets do actually possess 
clusterability properties at both sides. 

To summarize, despite the differences in the three 


° Without loss of generality, we take the first Yahoo dataset 
to provide statistics, for similar shapes of the bar plots can 
be established for the remaining ones. 


datasets, the experimental evidence we collected on them 
is quite consistent, in that in all the three cases COFIBA 
significantly outperforms all other competing methods we 
tested. This is especially noticeable during the cold-start 
period, but the same relative behavior essentially shows up 
during the whole time window of our experiments. COFIBA 
is a bit involved to implement, as contrasted to its com¬ 
petitors, and is also somewhat slower to run (unsurprisingly 
slower than, say, LINUCB-ONE and LINUCB-IND). On 
the other hand, COFIBA is far more effective in exploit¬ 
ing the collaborative effects embedded in the data, and still 
amenable to be run on large datasets. 


6. REGRET ANALYSIS 

The following theorem is the theoretical guarantee 
of COFIBA, where we relate the cumulative regret of 
COFIBA to the clustering structure of users U w.r.t. items 
I. For simplicity of presentation, we formulate our result 
in the one-hot encoding case, where Ui £ R'*, i = 1,... ,n, 
and T = {ei,..., e^}. In fact, a more general statement can 
be proven which holds in the case when T is a generic set 
of feature vectors T = {xi, ..., x\x\\, and the regret bound 
depends on the geometric properties of such vectors]^ 

In order to obtain a provable advantage from our clus¬ 
terability assumptions, extra conditions are needed on the 
way it and Ci* are generated. The clusterability assump¬ 
tions we can naturally take advantage of are those where, 
for most partitions P{eh), the relative sizes of clusters over 
users are highly unbalanced. Translated into more practical 
terms, cluster unbalancedness amounts to saying that the 
universe of items T tends to influence users so as to deter¬ 
mine a small number of major common behaviors (which 
need neither be the same nor involve the same users across 
items), along with a number of minor ones. As we saw in 
our experiments, this seems like a frequent behavior of users 
in some practical scenarios. 

Theorem 1. Let the COFIBA algorithm of Figure^he 
run on a set of users U = {1,..., n} with associated profile 
vectors ui,... ,u„ £ and set of items X = {ei,..., e^} 
such that the h-th induced partition P{e.h) over lA is made 
up of mh clusters of cardinality Vh,i, Vh, 2 , ■ ■ ■, Vh,mh! respec¬ 
tively. Moreover, let g be the number of distinct parti¬ 
tions so obtained. At each round t, let it be generated uni¬ 
formly at randon^from U. Once it is selected, the num¬ 
ber ct of items in Ct^ is generated arbitrarily as a func¬ 
tion of past indices ii,..., it-i, payoffs ai,..., at-i, and sets 
Ci ^,..., as well as the current index it. Then the se¬ 

quence of items in is generated i.i.d. (conditioned on it, 
Ct and all past indices ii,... ,it-i, payoffs ai,... ,at-i, and 
sets Ci.,,... according to a given but unknown dis¬ 

tribution T> overX. Let payoff at lie in the interval [—1,1], 
and be generated as described in Section so that, condi¬ 
tioned on history, the expectation of at is uJ xt. Finally, 
let parameters a and be suitable functions o/log(l/5). If 
Ct < c yt then, as T grows large, with probability at least 


^ In addition, the function CB should be modified so as to 
incorporate these properties. 

® Any distribution having positive probability on each i £U 
would suffice here. 

























1 — (5 the cumulative regret 

rt = O ^ ^E['S'] + y/cy/mn var(S') + 1 

where S = S{h) = y/'^h,j , h is a random index such 

that Sh ~ T), and E[-] and var(-) denote, respectively, the 
expectation and the variance w.r.t. this random index. 



To get a feeling of how big (or small) EfS"] and VAR [S'] can 
be, let us consider the case where each partition over users 
has a single big cluster and a number of small ones. To make 
it clear, consider the extreme scenario where each P{eh) has 
one cluster of size = n — (m — 1), and m — 1 clusters 
of size Vh,j = 1, with m < y/n. Then it is easy to see that 
E[S] = y/n — (m — 1) + m — 1, and var(S) = 0, so that the 
resulting regret bound essentially becomes 0{y/dT), which is 
the standard regret bound one achieves for learning a single 
d-dimensional user (aka, the standard noncontextual bandit 
bound with d actions and no gap assumptions among them). 
At the other extreme lies the case when each partition P{eh) 
has n-many clusters, so that E[5] = n, var(S) = 0, and 
the resulting bound is 0{y/dnT). Looser upper bounds can 
be achieved in the case when var(S') > 0, where also the 
interplay with c starts becoming relevant. Finally, observe 
that the number g of distinct partitions influences the bound 
only indirectly through var)^). Yet, it is worth repeating 
here that g plays a crucial role in the computational (both 
time and space) complexity of the whole procedure. 

Proof of Theorem [TJ The proof sketch builds on the 
analysis in [^. Let the true underlying clusters over the 
users be Vh,i,Vh, 2 , ■ ■ ■ ,Vh,m,,, with \Vh,j\ = Vh,j. In [^, the 
authors show that, because each user i has probability 1 /n 
to be the one served in round t, we have, with high prob¬ 
ability, Wi^t —>■ Ui for all i, as t grows large. Moreover, 
because of the gap assumption involving parameter 7 , all 
edges connecting users belonging to different clusters at the 
user side will eventually be deleted (again, with high prob¬ 
ability), after each user i is served at least 0(:^) times. By 
the way edges are disconnected at the item side, the above 
is essentially independent (up to log factors due to union 
bounds) of which graph at the user side we are referring to. 
In turn, this entails that the current user clusters encoded 
by the connected components of graph will eventually 
converge to the mn true user clusters (again, independent 
of h, up to log factors), so that the aggregate weight vectors 
WN^,t-i computed by the algorithm for trading off explo¬ 
ration vs. exploitation in round t will essentially converge 
to Ui^ at a rate of the forrrP^ 


_ \/l + 

where ht is the index of the true cluster over items that xt 
belongs to, jt is the index of the true cluster over users that 
it belongs to (according to the partition of U determined 

® The O-notation hides logarithmic factors in n, m, g, T, d, 
1 /5, as well as terms which are independent of T. 

Because X = {ei,... ,ed}, the minimal ^envalue A of 
the process correlation matrix E[A in llj is here 1/d. 
Moreover, compared to [B, we do not strive ro capture the 
geometry of the user vectors ut in the regret bound, hence 
we do not have the extra y/m factor occurring in their bound. 


by ht), ‘Pht,jt,t-i is the number of rounds so far where we 
happened to “hit” cluster i.e., 

— ~ l{® ^ I “ 1 • I 


and the expectation is w.r.t. both the (uniform) distribution 
of it, and distribution V generating the items in Ct^, con¬ 
ditioned on all past events. Since, by the Azuma-Hoeffding 
inequality, Tht,jt,t-\. concentrates as 
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It is the latter expression that rules the cumulative regret of 
COFIBA in that, up to log factors: 
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Eq. is essentially (up to log factors and omitted addi¬ 
tive terms) the regret bound one would obtain by knowning 
beforehand the latent clustering structure over U. 

Because ht G G 7 is itself a function of the items in Ct^, we 
can eliminate the dependence on ht by the following simple 
stratification argument. First of all, notice that 
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Then, we set for brevity S(h) = y/Ph/j, and let ht,k be 

the index of the true cluster over items that xt,k belongs to 
(recall that ht^k is a random variable since so is Xt^k). Since 
S{ht,k) < y/rtvn, a standard argument shows that 


max S{ht,k) 

k = l,...,ct 


Ec [Siht)] < Et> 

< E-D[5'(/it,i)] -I- yJcy/mnyAKT>{S{ht,i)) + 1 , 


so that, after some overapproximations, we conclude that 
rt is upper bounded with high probability by 


O Ei,[5'(h)] -b yJCy/^yAK-v{S{h)) + 1 


the expectation and the variance being over the random in¬ 
dex h such that ~ H. □ 


7. CONCLUSIONS 

We have initiated an investigation of collaborative filter¬ 
ing bandit algorithms operating in relevant scenarios where 
multiple users can be grouped by behavior similarity in dif¬ 
ferent ways w.r.t. items and, in turn, the universe of items 
can possibly be grouped by the similarity of clusterings they 
induce over users. We carried out an extensive experimen¬ 
tal comparison with very encouraging results, and have also 
given a regret analysis which operates in a simplified sce¬ 
nario. Our algorithm can in principle be modified so as to 
be combined with any standard clustering (or co-clustering) 
technique. However, one advantage of encoding clusters as 
connected components of graphs (at least at the user side) is 






























that we are quite effective in tackling the so-called cold start 
problem, for the newly served users are more likely to be 
connected to the old ones, which makes COFIBA in a po¬ 
sition to automatically propagate information from the old 
users to the new ones through the aggregate vectors WNi^,t- 
In fact, so far we have not seen any other way of adaptively 
clustering users and items which is computationally afford¬ 
able on sizeable datasets and, at the same time, amenable 
to a regret analysis that takes advantage of the clustering 
assumption. 

All our experiments have been conducted in the setup of 
one-hot encoding, since the datasets at our disposal did not 
come with reliable/useful annotations on data. Yet, the al¬ 
gorithm we presented can clearly work when the items are 
accompanied by (numerical) features. One direction of our 
future research is to compensate for the lack of features in 
the data by first inferring features during an initial train¬ 
ing phase through standard matrix factorization techniques, 
and subsequently applying our algorithm to a universe of 
items X described through such inferred features. Another 
line of experimental research would be to combine different 
bandit algorithms (possibly at different stages of the learn¬ 
ing process) so as to roughly get the best of all of them in 
all stages. This would be somewhat similar to the meta¬ 
bandit construction described in [^. Another one would 
be to combine with matrix factorization techniques as in. 
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